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Abstract: Metabolite identification is a major bottleneck in metabolomics due to the number 
and diversity of the molecules. To alleviate this bottleneck, computational methods and tools 
that reliably filter the set of candidates are needed for further analysis by human experts. 
Recent efforts in assembling large public mass spectral databases such as MassBank have 
opened the door for developing a new genre of metabolite identification methods that rely 
on machine learning as the primary vehicle for identification. In this paper we describe 
the machine learning approach used in FingerlD, its application to the CASMI challenges 
and some results that were not part of our challenge submission. In short, FingerlD learns 
to predict molecular fingerprints from a large collection of MS/MS spectra, and uses the 
predicted fingerprints to retrieve and rank candidate molecules from a given large molecular 
database. Furthermore, we introduce a web server for FingerlD, which was applied for 
the first time to the CASMI challenges. The challenge results show that the new machine 
learning framework produces competitive results on those challenge molecules that were 
found within the relatively restricted KEGG compound database. Additional experiments 
on the PubChem database confirm the feasibility of the approach even on a much larger 
database, although room for improvement still remains. 



Metabolites 2013, 3 



485 



Keywords: metabolite identification; molecular fingerprints; machine learning; FingerlD 



1. Introduction 

Metabolomics is the science of measuring and analyzing the pool sizes of metabolites, which 
collectively define the metabolome of a biological sample [1]. Metabolomics has numerous and diverse 
applications in medicine, pharmaceutical research, nutrition, forensics, anti-doping, plant research 
and biotechnology. Due to its unparalleled sensitivity and selectivity, mass spectrometry (MS) is a 
cornerstone measurement technology in metabolomics. 

Identification of metabolites from mass spectra is a prerequisite for further biological interpretation of 
metabolomics samples and metabolic modeling [1,2]. It is also the most time-consuming and laborious 
step in a metabolomics experiment [3]. 

An MS measurement of a metabolomics sample results in a set of peaks representing the 
mass-to-charge (m/z) ratios and intensities of the different metabolites of the sample. The fact that 
the same elemental composition and the same mass-to-charge ratio can arise from various different 
structures, as well as noise, adducts and fragments, hampers the identification of metabolites from 
MS data [4]. 

Tandem mass spectrometry (MS/MS) facilitates metabolite identification by fragmenting the detected 
compound and measuring the m/z ratios of the resulting fragment ions. Querying measurement spectra 
against spectral reference databases [5] and manual curation by domain experts dominate the current 
approaches via MS/MS. 

The reference database method is reliable as long as the database contains the corresponding 
spectrum, and the query and reference spectra are measured with compatible, or ideally, identical mass 
spectrometers with closely matching operating parameters. Due to the general similarity of spectra, 
misleading false positives can occur even if the database does not contain the correct spectrum. Indeed, 
conventional methods are often only able to identify a minority of the detected compounds, as low as 
10% [6] to 30% [7]. Unsurprisingly, a recent survey posed to MS experts found metabolite identification 
as the most important bottleneck in metabolomics today [3]. 

To alleviate the shortcomings of the reference database methods, computational approaches to model 
the fragmentation processes have been undertaken. Current state-of-the-art methods are based on 
combinatorial algorithms and database searches. Computation of fragmentation trees is tackled with 
several approaches [3,8-11]. The MetFrag software filters the compound databases by the precursor 
mass of the query mass spectra and for every candidates after filtering, the fragmentations are simulated 
and compared with the observed peak list [10]. SIRIUS used the analysis of isotopic patterns to give 
additional data on the metabolite's elemental composition [11]. 

Fueled by public mass spectral databases such as MassBank [5], the use of machine learning 
now represents a promising and so far under- utilized approach to improve the accuracy of metabolite 
identification and to decrease the burden of manual tuning of metabolite identification methods. A 
machine learning approach for metabolite identification through molecular fingerprints was very recently 
introduced [12]. FingerlD [13] relies on a two-step scheme. Instead of directly learning a mapping from 
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the spectrum to the metabolite, a set of characterizing fingerprints of the metabolite is first predicted from 
its tandem mass spectrum using a kernel-based approach. The fingerprint prediction model is learned 
from a large set of tandem mass spectra obtained from the public mass spectral database MassBank [5]. 
In the next step, the predicted fingerprints are matched against a molecular database such as KEGG [14] 
or PubChem [15] to obtain a list of candidate metabolites. The metabolite identification model is thus 
generalized to metabolites that are not presented in reference spectral databases. Due to the machine 
learning approach, data from any type of mass spectrometer is supported. 

In this paper, approaches used in the Critical Assessment of Small Molecule Identification 
(CASMI) [16] challenge, the first small molecule identification challenge in computational mass 
spectrometry community, are explained. In Section 2, the main methods underlying the FingerK) 
framework [12] and the new user interfaces are presented. In Section 3 the experimental data and 
methods are described followed by results in Section 4. Section 5 concludes the article with future work. 

2. Metabolite Identification through FingerlD 

In this section, the FingerlD metabolite identification framework [12] is described. An overview of 
the framework is shown in Figure 1. It consists of two main modules: 

• A molecular fingerprint prediction module that relies on support vector machines (SVM) equipped 
with a probability product kernel representation of mass spectra. 

• A molecule scoring and ranking module uses the predicted fingerprints to retrieve the best 
matching candidate molecules from a molecular database such as KEGG or PubChem. 



Figure 1. The FingerlD framework. 
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We describe these modules in the following. In addition, we outline the web server running FingerlD 
as well as the software distribution package available for download. 
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2.1. Fingerprint Prediction through SVM 

We define a mass spectrum x — {x(l)> • • • >x(^x)} as a set °f ^x P ea ks x(^) — (M^))^)) £ 
M 2 , (A; = 1, . . . ,£ x ) consisting of the peak mass n(k) and the normalized peak intensity i(k). Our 
goal is to learn a mapping between the mass spectra \ £ % an d a set °f m molecular fingerprints 
y = {Vj) 7 jLi € {+1, — l} m . The fingerprints encode molecular properties with the value yj = +1 
denoting the presence of j th property in the corresponding molecule. 

We estimate the mapping / : X — > {+1, — l} m using support vector machines (SVM) [17]. For each 
fingerprint, a separate SVM model 

Vjix) = sign (^>2 a ijVij K (x, Xi) j (!) 

was built using a training set {(xi, yi), • • • , (Xm Yn)} of tandem mass spectra of metabolites with their 
associated fingerprints. In Equation (1), the coefficients > 0 are dual variables denoting support 
vectors, training examples \% mat nave margin less or equal to unity in the model of the j th fingerprint. 
The kernel K measures the pairwise similarities of the spectra (see next subsection), and denotes the 
presence or absence of the j th fingerprint in the i th training molecule. 



2.2. Probability Product Kernel 

Previously, two types of kernels were proposed [12] for mass spectra, i.e., integral mass kernel and 
probability product kernel [18]. Integral kernel bins the mass of the peak to the nearest integer and takes 
it as the index in the feature vector. Integral mass kernel has a simple intuitive interpretation, however, it 
ignores the accurate mass of the peak and treats all peaks within [pt, — 0.5, /x + 0.5] as identical to mass 
fx. On the other hand, the probability product kernel assumes that the observed peak mass and intensity 
is only an approximation to the true mass and intensity by placing a 2-D Gaussian distribution over 
the observed peaks and intensities with Gaussian noise reflecting measurement errors. In practice, the 
probability product kernel achieves consistently better results [12]. Hence, in this paper and the CASMI 
challenges, the probability product kernel was used. 

The peaks x(k) of mass spectrum \ are represented with Gaussian distributions Vxik) — N(x{k), E) 
centered around the peak measurement and with covariance shared with all peaks 

< 0 
0 ol 

The variances a 2 and of for the mass and the intensity, respectively, are both estimated from data and 
no covariance is assumed between them. The spectrum x is finally represented as a mixture of its peak 
distributions v x = h Y?k=iPx(k)- 
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The probability product kernel K peaks between the peaks of two spectra XiX' is now given by an 
all-against-all matching of the underlying peak distributions [12]: 

f f 1 lx 1 V 

K pea ks(x,x') = K(p x ,p x ,) = / p x (x)p x ,(x)dx= / — V* P x ( k ) (x) • — Y] p x >( k ') (x)dx 

</R 2 -/R 2 «X til *X tCl 

= 4^ HP H <XW " ^^W*) - *'(*'))) P) 
x x fc,fc'=l ^ 7 

The kernel is computationally efficient as the integrals take a closed form that can be solved 
analytically without numerical integration. 

Two additional variants of the probability product kernel were considered, which differ in the way of 
deriving the underlying probability distributions: 

• Mass loss kernel K m i oss records the difference between a fragment peak and the theoretical 
precursor peak by centering a Gaussian at the difference, giving the probability 

Px(k) = N(x( k ) ~ X, S) 

where \ — (Uprec, is a dummy peak with the precursor mass and the same intensity as the 
peak x(k). This kernel can be interpreted as capturing putative cleaved fragments or combinations 
of them. 

• Mass difference kernel K di f f computes the difference of all pairs of peaks and centers the Gaussian 
at the peak difference Pdiff(x(k), x(k')) ~ -A/"(x(^) — x(k'), £). This kernel can be seen as a 
generalization of the mass loss kernel by not fixing a precursor mass but instead recording all 
possible fragmentation reactions between the peaks of two mass spectra. The kernel computation 
has quadratically higher complexity compared with the other two variants. 

The above base kernels can be combined to several types of spectral features. The experiments in [12] 
showed that the combination of the peaks and mloss kernels demonstrates a good prediction accuracy 
and shorter computation time compared with combinations involving Kdiff. The K pea k s +mioss was used 
in the CAS MI challenge. 

2.3. Candidate Retrieval 

Given the predicted fingerprint vector corresponding to a tandem mass spectrum, the candidate 
molecules matching these fingerprints are retrieved from a molecular database such as KEGG or 
PubChem. As a preprocessing step, one needs to generate the true fingerprint vectors of each molecule 
in the database. 

In matching the predicted fingerprints to the observed ones in the database, it is sensible to give more 
weight to fingerprints that can be predicted reliably from the mass spectrum. To implement that idea, 
FingerlD uses a probabilistic model that exploits the cross-validation accuracies (7j)^Li [0-5, l] m of 
the fingerprints as the reliability scores. Given the reliability scores and the predicted fingerprints y, the 
model assigns the Pois son-binomial probability for the fingerprint vector y as follows: 

m 

ppB(y\j^)=ni [ r ijA (i-j J ) iv ^ ] 

5=1 
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The above can be interpreted as measuring the likelihood of a fingerprint y to be the source generating 
the observations y. For a molecule M, the probability of its fingerprint vector y (M) gives its score used 
in candidate retrieval: 

score(M) = P PB (y(M)| 7 , y) (3) 

It is useful to reduce the number of candidates by filtering the molecular database by the exact mass 
of the molecule. If the mass is not known, it is estimated from the MS2 or MSI spectrum. A a small 
search window is set [fiM — A, /xm + A] around the estimated mass [Lm of the unknown molecule, and 
the records in the molecular database that exceed the allowed mass difference are filtered out. The size 
of the candidate set is obviously dependent on the width of the search window. A smaller width gives 
a smaller candidate set but has a higher risk to miss the true molecule, if the exact mass was estimated 
incorrectly. Based on the fingerprint scores in Equation (3), a ranked list is generated for the molecules 
within the mass window. 



2.4. FingerlD Web Server 

The FingerlD web server [19] was initially built for tackling the CASMI challenges. The server 
provides easy access to user who wishes to try the machine learning framework for metabolite 
identification. The underlying database on the server is KEGG. A screen-shot is shown in Figure 2. 

Figure 2. A screen-shot of FingerlD web server. 
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The FingerlD web server has two modes. In the query mode, the user can submit an http form to the 
server with the information related to the unknown molecule. The server will send back the result in a 
table. In the batch mode, the user can write the queries in pre-defined format files and compress them in 
a package. Then the user can upload the package to the server and the server will send back a package 
of results to the user. 

Both modes share the same set of input search parameters. Exact mass is used in the beginning of 
database matching. Search PPM specifies the width of the mass window. It should be adjusted according 
to the assumed precision of the exact mass. Precursor is required for the mloss feature. Peaks are simply 
the list of masses and intensities of the MS/MS. Device type determines which trained model are used 
for molecular fingerprint prediction. The user should set the Device type according to his own mass 
spectrometer type. Mode tells the ionization mode of the mass spectrometer and it is useful in both 
estimating the exact mass and aligning the peaks. 

2.5. FingerlD Software Distribution 

When using the FingerlD web server, the user can only use the trained model provided by the web 
server administrators. The FingerlD package [13] allows the user to select his own training mass spectra 
to train the prediction models. The training spectra should be in MassBank [5] format. Model parameters 
and database search parameters are specified in a configuration file. Training process could be minutes 
to hours or even more, depending on the size of the training data. More detailed instructions can be 
found on sourceforge project home page 5 and in the readme file of the package. 

3. Materials and Methods 

3.1. CASMI Challenge Data 

The 2012 CASMI contest [16] had four categories. Categories 1 and 2 are for high-resolution LC/MS 
data coupled with MS/MS while Categories 3 and 4 are for nominal mass GC/MS data. Categories 1 and 
3 concern the identification of the chemical formula and Categories 2 and 4 concern the identification 
of the molecular structure. FingerlD utilizes MS/MS data, therefore, only the Categories 1 and 2 
are relevant. 

All the challenge molecules are measured by two devices: Bruker micrOTOF-Q and LTQ-Orbitrap 
in which APCI ionization and ESI ionization are both used. Some analytes are better ionized by APCI 
and even in the cases that ESI and APCI deliver comparable ionization efficiency, the analytes respond 
differently [20]. In addition, for the LTQ-Orbitrap, both collision-induced dissociation (CID) and higher 
energy collisional dissociation (HCD) appear in the challenges and the resulting spectra for the same 
molecules are quite different. 

For those CASMI challenges not explicitly indicating the precursor mass, it is assumed to be the 
molecular ion with some adducts, usually [M+H] + . With MSI data at hand, estimating the exact mass 
of the unknown molecule is usually straightforward. Most MSI spectra in the CASMI challenges have 
clear isotopic distributions. If the spectrum is measured in positive mode, the exact mass of the molecule 
can be computed as the most abundant peak in MS 1 minus the mass of the proton. The most abundant 
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peak was taken as the precursor if it is not given. However, positive mode does not necessarily imply 
[M+H] + . Other adducts such as [M+2H] + and [M+Na] + are also possible. Furthermore, noise and 
measurement error may lead to the difference larger or smaller than the mass of the proton. 

3.2. Mass Spectral Training Data 

Due to the differences in the CASMI challenges we decided to train different fingerprint prediction 
models for the different setups. In particular, we define three variables for our datasets: instrument 
type, ionization type and fragmentation method. According to these variables, the challenge data 
can be categorized to 5 models, which are (1) LC-APCI-ITFT-CID; (2) LC-APCI-ITFT-HCD; 
(3) LC-ESI-ITFT-CID; (4) LC-ESI-ITFT-HCD and (5) LC-ESI-QTOF-CID. 

The summary of training data obtained from MassBank is listed in the first part (MS2) of Table 1 . The 
molecules in Model (1) are a subset of Model (2), the molecules in Model (3) are a subset of Model (4). 
The molecules in Model (5) are quite different, with at most 5 molecules overlapping with the other 
models. Models (1) and (2) have 5 molecules also in Models (3) and (4). 



Table 1. The training datasets statistics. The number of molecules is smaller than the 
number of spectra because of the existence of mass spectra of the same molecules measured 
in different collision energies. 



MS type 


Instrument type 


No. of spectra 


No. of molecules 


Fingerprints 


MS2 


(1) LC-APCI-ITFT-CID 


295 


65 


179 




(2) LC-APCI-ITFT-HCD 


882 


86 


181 




(3) LC-ESI-ITFT-CID 


447 


224 


281 




(4) LC-ESI-ITFT-HCD 


2655 


225 


281 




(5) LC-ESI-QTOF-CID 


1027 


523 


290 


MSI 


LC-ESI-ITFT 


41 


41 






LC-ESI-QTOF 


62 


62 





An auxiliary dataset named QqQ, which contains spectra of collision energies of 10 eV 
(491 molecules), 20 eV (502 molecules), 30 eV (502 molecules), 40 eV (490 molecules) and 50 eV 
(449 molecules), is used to analyze the effect of different collision energies on the fingerprint prediction. 

3.3. Molecular Fingerprints 

OpenBabel [21] was used to generate molecular fingerprints, FP3 (55 bits), FP4 (307 bits) and 
MACCS (166 bits), all together 528 bits. However, in each dataset many of the fingerprints are either 
present in all molecules (+1) or absent (—1), which means they provide no information for training. 
We removed these ineffective fingerprints from each dataset. The number of the remaining effective 
fingerprints are shown in Table 1 . 
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3.4. Molecular Databases 

For the CASMI challenges, the KEGG [14] compound database was usedas the underlying molecular 
database, which contained 11,657 molecules. After the CASMI submission, the experiments with 
PubChem, which contains more than 30 million molecules, as the molecular database in place of KEGG 
were conducted. 

3.5. SVM Model Training and Evaluation 

Multiple spectra related to a single molecule are a potential source of bias in a cross-validation setting, 
if some of the spectra end up in the testing fold while others are present in the training fold. In such cases, 
the cross-validation accuracy becomes artificially high. To avoid this problem, the following stratified 
cross-validation scheme was used in training the models: all spectra related to a particular molecule 
were confined to the same cross-validation fold. Thus, in each trained model, either all spectra of a given 
molecule were present in the training data and none in the testing, or vice versa. 

The margin softness parameter "C" for SVM is chosen from the list of [2~ 5 , 2~ 4 , . . . , 2 10 ] 
independently for every fingerprint and training fold. 

4. Results and Discussion 

This section begins by reporting and analyzing the CASMI challenge results. Next, the different 
design choices in our approach and their effect on the metabolite identification performance are 
examnied. Finally, the extensions and improvements to FingerlD that were not used in the CASMI 
challenge are presented. 

4.1. CASMI Challenge Results 

The FingerlD results submitted to the CASMI contest is shown in Table 2. The model that was used 
for each challenge to predict fingerprints is also included in this table. Some challenges can map to 
several models, such as Challenge 13 where both CID and HCD data are available. In this case, the 
model with better cross validation performance was used, which is Model (4) LC-ESI-ITFT-HCD. 

The results for the molecule identification (Category 2) were computed first, and the chemical formula 
identification is directly taken from molecule identification. As a result, it is more meaningful to discuss 
the results of Category 2. Expectedly, molecule identification proves to be a harder problem than 
chemical formula prediction. As an overall finding, most of our failures in the challenges are due to 
the limited molecular database used for retrieval: the molecules in challenges 3,4,10,12-17 were not in 
our version of KEGG. Exception is Challenge 1 1 in which the exact mass was estimated incorrectly. 

The distributions of the scores for all the candidates in 14 challenges in Category 2 are shown in 
Figure 3. One observation is that the score distributions are quite flat, indicating good discriminability 
of FingerlD. Below, the results of those challenges where the correct molecules were in the KEGG are 
discussed individually. 
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Table 2. Absolute ranks of the correct molecules in the solution lists for the challenges in 
Category 1 and 2 and the number of candidates. Challenges where the correct molecule is not 
in the result list are marked with "-" instead. The model used for each challenge to predict 
the fingerprints is listed in the Model row. The proportions of wrongly predicted molecular 
fingerprints are shown in the last row. 
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1 


2 
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10 


11 


12 


13 


14 
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(5) 


(5) 


(5) 


(5) 


(1) 


(1) 


(1) 
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(2) 


(2) 


(4) 


Category 1 rank 


4 


1 
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1 
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5 
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Figure 3. The scores of all candidates (y-axis) for the challenges (x-axis) are shown in black 
dots where the size indicates the number of candidates sharing the same scores. The correct 
solutions are indicated with a square. For the challenges where the correct molecules were 
not in the database, the hypothetical rankings (assuming the molecules were in the database) 
of the true molecules are shown by the cross markers. 
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• In Challenge 1, the correct molecule was in the fifth position out of six. The score of the correct 
molecule was rather low; the high rank is probably a consequence of KEGG not having many 
molecules with a similar molecular weight, rather than good fingerprint prediction. 
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• In Challenge 2, the search ppm was set to 200 and the correct solution was obtained despite the 
30 ppm error in the original challenge data. In the version of KEGG, only three entries have the 
mass around 592.1792 within 200 ppm and only two of them had molecular fingerprint generated 
using OpenBabel. Thus, the identification is simply choosing one from the two and the FingerlD 
made the right choice. Incidentally, after correcting the 30 ppm error, FingerlD still ranked the 
true molecule at the top, which surpassed other CASMI participants. 

• In Challenge 5, the correct molecule ranks fifth and had the same confidence score as the fourth 
one. If the CASMI organizers took the rank of the score as criterion, this would have been the 
winning entry for this challenge. 

• In Challenge 6, half of the candidates had better scores than the correct one, which means the 
molecular fingerprint prediction was not perfect. However, comparing the absolute rank to the 
other participants, this was sufficient to win this challenge. 

• In Challenge 11, the highest intensity peak in MSI is not the molecule with an adduct. Thus, the 
exact mass of the molecule was estimated incorrectly. 

The question of how the challenge molecules differ from the training molecules and molecules in 
the database is answered by comparing the fingerprints , as shown in Appendix A. A simple similarity 
function between fingerprints is defined and histograms of the similarity scores are presented. 

4.2. Evaluation of the FingerlD Framework in the CASMI Contest 

Here we study the effects of different aspects and design choices in FingerlD that explain the CASMI 
results and point directions for future improvement. The robustness of the fingerprint prediction model 
and the prediction of the exact mass are first studied, followed by a discussion of how to utilize mass 
spectra measured by different collision energies. Finally, the uniqueness of fingerprints is also explored. 

4.2.1. Effect of Training Set Size on Fingerprint Prediction Reliability 

To understand how the size of training set affects the fingerprint prediction accuracy, subsamples 
of the data of gradually increasing size were generated and 10-fold stratified cross-validation (see 
Section 3.5) on each subsample was conducted. Each time one fold was picked as testing data and 
20%, 40%, 60%, 80% and all of the remaining data were extracted as training data. The resulting 
curves for cross-validation and training error are shown in Figure 4, together with the relative rank of the 
retrieved molecule. In these experiments, fingerprints for which the majority class accounts for at most 
80% are included. 

A general trend in Figure 4 is that with more data to train, lower testing error and better relative 
rank of the correct molecule were observed. An exception is the LC-ESI-QTOF dataset where it 
seems the performance after training with 60% of the data could not be improved much. For the 
LC-ESI-ITFT-HCD, the cross validation testing error approaches zero rapidly. The APCI-ITFT dataset, 
which is the combination of the APCI-ITFT-CID and the APCI-ITFT-HCD, shows that the merging 
produces slightly better performance than simply using the APCI-ITFT-CID data but slightly worse 
result then just using the APCI-ITFT-HCD data. 
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Figure 4. This figure shows for different size subsamples the average cross-validation test 
error (solid lines) and training error (dashed lines) over all fingerprints, and the average 
relative rank (dotted lines) of the correct molecule in the list of retrieved candidates. 
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In summary, most of the datasets seem to be large enough to give good average prediction quality 
for fingerprints. In addition, the metabolite identification performance is seen to correlate with the 
fingerprint prediction error in a clear way. However, the datasets are too small to represent the whole 
metabolite space in balanced manner. Hence, the good results within each dataset may not translate to 
good metabolite identification results outside the region of metabolite space covered by the training data. 

4.2.2. Quality of Exact Mass Prediction 

To understand how accurate the approach of deriving the exact mass from the spectra of the unknown 
molecules, the difference (measured in ppm) between the exact mass and the predicted exact mass for 
our training datasets was compared, as shown in Figure 5. 

Figure 5 shows in most cases, the difference between the predicted and the real exact mass of the 
molecule is less than 10 ppm, which indicates the suitable mass window width. For the CASMI contest, 
the mass window width was set to range from 200 ppm to 500 ppm even though in hindsight our data 
would have supported a much narrower search window. In the APCI-ITFT-CID, APCI-ITFT-HCD and 
LC-ESI-QTOF-CID datasets, there are some outliers where the exact mass prediction is off by huge 
amount (1000s of ppms), which need to be investigated in the future. 
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Figure 5. Histogram of the difference between derived mass and exact mass, measured 
in ppm, intercepted at 10. The difference less than 10 ppm accounts for 88%, 85%, 
100%, 100%, 90% of the difference between derived mass and exact mass for five 
datasets respectively. 
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4.2.3. Effect of Using Multiple Collision Energies 

As observed from Table 1, many spectra of the same molecule exist in the training dataset. This is a 
result of different collision energies used in the measurement, which leads to different fragmentations of 
the precursor ions. 

To test if mixing the different collision energies in training data has a positive or negative effect on 
fingerprint prediction, one fifth of the molecules measured in collision 30 eV in the QqQ dataset were 
taken as fixed testing data and the rest of the dataset was used for training in two setups: training with 
30 eV only and training with all collision energies. The result is shown in a scatter plot in Figure 6. 
In this experiment, only fingerprints for which the majority class accounts for at most 80% of the data 
were considered, as achieving high predictive accuracy for those is more challenging than the more 
biased fingerprints. 

Figure 6 shows that even though the resulting spectra of different collision energies are not the 
same, integrating them as one model always improves prediction accuracy. Combining spectra from 
different collision energies instead of building a model for every single collision energy reduces model 
complexities and gains prediction accuracy. 

4.2.4. Degree of Uniqueness of Fingerprints 

Molecular fingerprints describe the selected properties of a molecule in the form of a bit vector. In 
FingerlD, the molecular fingerprint is an intermediate representation for identifying candidate molecules. 
As the ranking of the candidates is based on the fingerprint vector and the mass of the molecule, 
fingerprint vectors that are shared by large number of molecules cause the candidate lists to grow in 
size, which is not desirable. In this experiment, for each different fingerprint configuration in KEGG and 
PubChem, respectively, the duplicates in the database were counted and denoted as Nd p . We show the 
cumulative distribution of N dp (in log scale) in Figure 7. 

Fingerprints with Nd p = 1 are unique in the database. In the versions of KEGG and PubChem 
databases used, we have 9,399 and 18,043,141 different fingerprint configurations, respectively. In 
KEGG (resp. PubChem), among them, 7,454 (resp. 12,829,187) fingerprint configurations are unique, 
and 1,114 (resp. 3,282,098) fingerprint configurations have one duplicate molecule. The highest number 
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of duplicates in KEGG is 86 and 2,501 in PubChem. In terms of molecules, only 64% of the molecules 
in KEGG and 43% in PubChem have unique fingerprints. 

Figure 6. Fingerprint prediction accuracy on QqQ data with single collision energy data 
versus mixed collision energy data. The x-axis shows the accuracies of training model only 
using collision energy 30 eV and the y-axis shows the accuracies of training model using all 
available collision energies. 
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Figure 7. Cumulative distribution of N dp , which is the number of molecules having the N dp 
duplicates or less. Both x-axis and y-axis are in log scale. 
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For molecule identification, the harmful effect of the duplicate fingerprints existing in the database 
is partly diluted by considering the masses of the molecules: the percentage of molecules with unique 
fingerprints and unique mass are 75% and 73% in KEGG and PubChem, respectively. 

4.3. Extensions 

In the following, the extensions of FingerlD that were not used in the CASMI contest are examined, 
namely using isotopic distribution information from MSI spectrum and the use of PubChem as the 
molecular database instead of KEGG. 

4.3.1. Isotopic Distribution Matching 

Each chemical element can have several isotopes that share the same protons and electrons but 
different number of neutrons. The isotopes occur in nature in certain abundances. For example, carbon 
has two stable isotopes 12 C and 13 C with abundance of 98.890% and 1.110%. 

For an element E with r isotope types, a molecule consisting of / atoms of that element has 
Ct-^ 1 ) different isotopologues caused by that element [22]. All of the isotopologues also have 
distinct abundances, which can be derived by applying multinomial probability over the isotope 
abundances [23,24]. Consequently, the theoretical mass spectrum that arises from the set of 
isotopologues can be simulated and compared with the observed spectrum. This information is more 
informative for metabolite identification than using the mass alone. Many methods and tools have been 
published for this purpose [25-27]. 

In the CASMI challenges, MSI data were also given and most of them contain isotopic distributions. 
This allows us to rank the candidates based on matching between observed isotopic distribution and 
simulated isotopic distributions from the chemical formulas. For the matching score, the probability 
product kernel function (2) was used. 

The fingerprint based score (3) was fused with the isotopic distribution based score to obtain the final 
ranking. Two rank aggregation methods are investigated: taking the average rank or minimum rank as 
the combined ranking. Some molecules may have the same chemical formulas and thus they receive 
the same isotopic distribution matching scores. Another slightly more complicated method is ranking 
the candidates by the matching isotopic distribution score first and for those candidates having a tie, 
reranking them by the fingerprint based scores. 

MSI data was used for LC-ESI-ITFT and LC-ESI-QTOF datasets as shown in the last two rows of 
Table 1. Compared with MS2 data, only a few molecules have the MSI data in the database. FingerlD 
was trained with the MS 2 data and the MSI data was used for isotopic distribution matching. Then the 
combining method mentioned above was applied to merge two ranked list. 

Figure 8 shows that FingerlD alone achieves top 1 rank for the largest fraction of the data for both 
datasets. It is also shared best with average rank aggregation on the whole of QTOF data. On ITFT 
data, FingerlD alone is the best until top 5 rank but fades beyond the competing methods in ranking the 
tail of the data. Reranking with FingerlD the candidates with matching isotopic distributions is a better 
approach than minimum rank. As the CASMI challenges used the absolute rank of the correct molecule 
as the goodness criterion, the FingerlD alone was used to solve the challenges. 



Metabolites 2013, 3 



499 



Figure 8. The rank distribution of metabolites using FingerlD alone, isotope matching alone 
and the different rank aggregation schemes; average rank, minrank and rerank. The higher 
the curve, the larger the proportion of highly ranked molecules. The x-axis is in log scale. 
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4.3.2. Using PubChem as the Molecular Database 

In the CASMI contest, KEGG was used as the underlying database to search candidates. As many of 
the challenge molecules were not recorded in KEGG, the method failed to identify those molecules 
correctly. After the challenge deadline, PubChem [15] was investigated as the alternative source 
of candidate molecules. As PubChem contains more than 30 million compounds, several orders of 
magnitude larger than KEGG, the recall of molecules is improved. At the same time, however, the 
number of candidates within a mass window increases, making it harder to rank the correct molecules 
towards the top. Table 3 shows the result for Category 2 when using PubChem as the underlying 
molecular database. 

Table 3. Retrieval of candidate molecules from PubChem. The top part of the table 
corresponds to the CASMI setup except for replacing KEGG with PubChem. The middle 
part shows the results with 10 ppm mass window with PubChem. The bottom rows depict the 
best possible results that could be obtained if the optimal mass window width was known. 
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The top three rows show the results obtained when searching PubChem using the same mass window 
width as was used in the CASMI challenges. It can be observed that for most of the challenges the 
correct molecules are among the retrieved candidates, with the exception of Challenge 11, in which the 
precursor mass was estimated incorrectly, and Challenge 12, where the molecule is not found in the 
version of PubChem. The candidate lists are quite long and the absolute ranks of the correct molecules 
are typically too low to allow manual checking by a human expert. For example, the correct molecule in 
Challenge 3 is ranked as 355 out of 87,514, which corresponds to top 0.4% of the retrieved candidates. 

The fourth row of the Table 3 shows the effect of choosing a much smaller mass window of 10 ppm, 
based on the statistics in Figure 5, which suggest that larger errors in exact mass prediction are relatively 
rare. This result shows remarkable improvement in ranking the challenge molecules. In fact, these could 
have won Challenges 3 and 10. Correspondingly, in Challenge 2 the correct candidate would have been 
pruned out due to the 30 ppm error in the mass. 

In the last two rows of Table 3, the idealized case where the smallest ppm range is sought for that still 
keeps the correct molecule in the candidate list is inspected. Ideal ppm is defined as the smallest value for 
ppm that still allows the candidate list contain the correct molecule, and ideal rank is the corresponding 
rank that is achieved by using the ideal ppm mass window. It can be noted that ideal ranks are in many 
cases not much better than that achieved using the 10 ppm mass window. 

5. Conclusions 

The prediction of molecular fingerprints from tandem MS/MS using machine learning methods 
tackles the molecule identification problem in a brand new way. In this approach, observed MS/MS 
spectra are not directly compared with an MS spectral database or simulated MS/MS spectra. Instead, 
molecular fingerprints are predicted and then used to search a molecular database of choice. The CASMI 
contest results show this machine learning approach is competitive with other current methods. 

This approach is modular in that the support vector machine used in this paper could be changed to 
any other machine learning approach. The set of fingerprints generated by OpenBabel can be replaced 
by others such as PubChem fingerprints. Finally, the molecular database used to retrieve candidate 
molecules can be changed flexibly. 

The presented approach still requires further development. The first issue is that the prediction is 
highly dependent on the selection of the training dataset. Using merely a few hundred molecules as 
training data cannot represent the whole relevant chemical space. Hence, some testing molecules that are 
beyond the learned model domain may be encountered. However, as the results so far suggest, relatively 
good performance can already be achieved with moderate training sets of a few hundred molecules. 
Thus, extending the approach to new regions of metabolite space via generating spectral training data 
for the sparsely populated regions should not prove an insurmountable challenge. 

The molecular fingerprint prediction also incurs the necessity for studying the properties and 
structures of the molecular fingerprints. Appendix A shows lots of molecules sharing the same 
fingerprints in the KEGG and PubChem, even through 528 bits OpenBabel substructure fingerprints 
can theoretically produce 2 528 fingerprint configurations. Simply changing OpenBabel fingerprints to 
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a larger set of fingerprints (e.g., PubChem fingerprints) can reduce some duplicates, but more detailed 
research within the set of fingerprints is required. 

There are several ways to refine the set of fingerprints and their predictions. First, modeling the 
dependencies among the fingerprints could improve predictive accuracy, such as learning a Bayesian 
network of fingerprints and using the learned dependency graph as the input for structured learning. 
Second, weighting the fingerprints by the ability to differ molecules could help the ranking of candidates. 

Choosing the molecular database for candidate retrieval is the final crucial component. A small 
database will produce shorter candidate lists but with high chance the correct molecule is not in the list, 
unless the molecules to be identified match the database very well. A much larger database such as 
PubChem includes the correct molecule in many cases but may produce a prohibitively large candidate 
lists. Using a small enough mass window can alleviate this problem to significant degree. 

There are several further directions for improving the metabolite identification framework. First, 
FingerlD does not use information of plausible fragmentation trees, which are effectively used in 
several competing systems. Second, using domain specific information about the kinds of molecules 
that are more plausible than others could improve the framework. Finally, the molecular database can 
be discarded if combinatorial algorithms are developed to reconstruct the molecule structure given the 
predicted molecular fingerprints. 
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Appendix 
Appendix A 

This appendix contains the definition of similarity between two fingerprints and similarity distribution 
between CASMI challenge molecules in Category 1 & 2 and training molecules (Figure Al) and 
molecules in KEGG compound database (Figure A2). 

Given two vectors of m molecular fingerprints, y = 1 and y' = (yQ^Lj, the similarity between 
them, if no prior knowledge exists on the weight of each individual fingerprint, is defined by: 

s(y,y') = i--E^-^l 

i=l 

By this definition, two same fingerprints give similarity score of 1 while two totally different 
fingerprints give similarity score of 0. 
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Figure Al. Distribution of similarity scores between challenge molecules and training data 
(used to predict the challenge). 
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Figure A2. Distribution of similarity scores between challenge molecules and molecules in 
the KEGG compound database. 
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